# **Synthesis for Customized Computing**

#### **Jason Cong**

Chancellor's Professor, UCLA
Director, Center for Domain-Specific Computing

cong@cs.ucla.edu http://cadlab.cs.ucla.edu/~cong



# Why Customized Computing?

| AES 128bit key<br>128bit data | Throughput     | Power  | Figure of Merit<br>(Gb/s/W) |  |
|-------------------------------|----------------|--------|-----------------------------|--|
| 0.18mm CMOS                   | 3.84 Gbits/sec | 350 mW | 11 (1/1)                    |  |
| FPGA [1]                      | 1.32 Gbit/sec  | 490 mW | 2.7 (1/4)                   |  |
| ASM StrongARM [2]             | 31 Mbit/sec    | 240 mW | 0.13 (1/85)                 |  |
| ASM Pentium III [3]           | 648 Mbits/sec  | 41.4 W | 0.015 (1/800)               |  |
| C Emb. Sparc [4]              | 133 Kbits/sec  | 120 mW | 0.0011 (1/10,000)           |  |
| Java [5] Emb. Sparc           | 450 bits/sec   | 120 mW | 0.0000037 (1/3,000,000)     |  |

Source: P Schaumont and I Verbauwhede, "Domain specific codesign for embedded security," IEEE Computer 36(4), 2003

#### How to Improve the Efficiency? Our Proposal - Customized Computing with Accelerator-Rich Architectures

#### A customizable heterogeneous platform (CHP)

- With a sea of dedicated and composable accelerators
- Most computations are carried on accelerators not on processors!

#### A fundamental departure from von Neumann architecture

#### Why now?

- Previous architectures are device/transistor limited
- Von Neumann architecture allows maximum device reuse
  - · One pipeline serves all functions, fully utilized

#### Future architectures

- Plenty of transistors, but power/energy limited (dark silicon)
- Customization and specialization for maximum energy efficiency

#### A story of specialization

<sup>[1]</sup> Amphion CS5230 on Virtex2 + Xilinx Virtex2 Power Estimator
[2] Dag Arne Osvik: 544 cycles AES – ECB on StrongArm SA-1110
[3] Helger Lipmaa Pill assembly handcoded + Intel Pentium III (1.13 GHz) Datasheet
[4] gcc., 1 mWMHz @ 120 Mhz Sparc – assumes 0.25 u CMOS

<sup>[5]</sup> Java on KVM (Sun J2ME, non-JIT) on 1 mW/MHz @ 120 MHz Sparc – assumes 0.25 u CMOS

#### Lessons from Nature: Human Brain and Advance of Civilization

- ◆ High power efficiency (20W) of human brain comes from specialization
  - Different region responsible for different functions
- Remarkable advancement of civilization also from specialization
  - More advanced societies have higher degree of specialization



#### **UCLA** Newsroom

Hom

UCLA Newsroom > All stories > News Releases

#### All Stories

All Stories Featured News

News Releases

Advisories Images Multimedia

Research Health Sciences Arts & Humanities

Student Affairs Academics & Faculty

Campus News

Media Contacts



For the Media

Contacts News releases Advisories About UCLA

# NSF awards UCLA \$10 million to create customized computing technology

By Wileen Wong Kromhout| 8/11/2009 9:45:00 AM

The UCLA Henry Samueli School of Engineering and Applied Science has been awarded a \$10 million grant by the National Science Foundation's Expeditions in Computing program to develop high-performance, energy efficient, customizable computing that could revolutionize the way computers are used in health care and other important applications.

In particular, UCLA Engineering researchers will demonstrate how the new technology, known as domain-specific computing, could transform the role of medical imaging and hemodynamic simulation, providing more cost-effective and convenient solutions for preventive, diagnostic and therapeutic procedures and dramatically improving health care quality, efficiency and patient outcomes.

"This significant award is another testament to the world-class faculty here at UCLA who continue to push the envelope to solve society's most pressing issues," said UCLA Chancellor Gene Block. "We are grateful to the NSF, which has repeatedly provided crucial funding to our faculty, helping to place the university among the nation's top five in research funding."

In an effort to meet ever-increasing computing needs in various fields, the computing industry has entered an "era of parallelization," in which tens of thousands of computer servers are connected in warehouse-scale data centers, said Jason Cong, the Chancellor's Professor of Computer Science and director of the new UCLA Center for Domain-Specific Computing (CDSC), which will oversee the research. But these parallel, general-purpose computing systems still face serious challenges in terms of performance, energy, space and cost.

Domain-specific computing holds significant advantages, Cong said. While general-purpose computing region computing region computing region computing utilizes a customizable architecture and custom-oriented, high-level computer languages tailored to a particular application area or domain — in this case, medical imaging and hemodynamic modeling. This customization ultimately results in much less energy consumption, faster results, lower costs and increased productivity.

The goal of the new UCLA center, Cong said, is to look beyond parallelization and focus on domainspecific customization to bring significant power-performance efficiency improvement to important application domains.

#### Levels of Customization

- ◆ Single-chip level
  - Require new processor designs, e.g. using composable accelerators [ISLPED' 12, DAC'14]
- Server node level
  - Host CPU + FPGA via PCI-e or QPI connections
- Data center level
  - Clusters of heterogeneous computing nodes

#### Levels of Customization

- ◆ Single-chip level
  - Require new processor designs, e.g. using composable accelerators [ISLPED' 12, DAC'14]
- ◆ Server node level
  - Host CPU + FPGA via PCI-e or QPI connections
- ◆ Data center level
  - Clusters of heterogeneous computing nodes





# 5 Years of Accelerating Medical Image Processing

|                         | 2010              | 2013                        | 2015 (Today)                            |
|-------------------------|-------------------|-----------------------------|-----------------------------------------|
| CT image reconstruction | 18 hours          | 20 minutes                  | 6 minutes                               |
|                         | Single thread CPU | FPGA acceleration on Convey | 4 Virtex-6 FPGAs on Convey w/data reuse |
| Denoising               | 5 minutes         | 15 seconds                  | 3 seconds                               |
|                         | Single thread CPU | NVidia GPU                  | Core i7 Haswell, OpenMP, stencils       |
| Registration            | 10 minutes        | 2 minutes                   | 30 seconds                              |
|                         | Single thread CPU | NVidia GPU                  | Core i7 Haswell, OpenMP, stencils       |
| Segmentation            | 20 minutes        | 4 minutes                   | 1 minute                                |
|                         | Single thread CPU | Multithread CPU             | Core i7 Haswell, OpenMP, stencils       |
| Analysis                | 45 minutes        | 18 minutes                  | 5 minutes*                              |
|                         | Single thread CPU | Multithread CPU             | Core i7 Haswell, OpenMP                 |







Workstation

CPU + GPU

FPGA, CPU platform

#### **Levels of Customization**

#### ■ Single-chip level

 Require new processor designs, e.g. using composable accelerators [ISLPED' 12, DAC'14]

#### ■ Server node level

■ Host CPU + FPGA via PCI-e or QPI connections

#### ■ Data center level

- Clusters of heterogeneous computing nodes
- How about programming at data center level?





#### Accelerating Large-Scale Services - Bing Search

1,632 Servers with FPGAs Running Bing Page Ranking Service (~30,000 lines of C++)



A. Putnam, "A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services", ISCA'2014

15









# How to Program Such "Beasts"?

-- "Write Once, Compile Anywhere"





#### AutoPilot Results: Optical Flow (from BDTI)

- Application
  - Optical flow, 1280x720 progress scan
  - Design too complex for an RTL team
- Compared to high-end DSP:
  - 30X higher throughput, 40X better cost/fps

|                                                       | Chip<br>Unit<br>Cost | Highest Frame<br>Rate @ 720p<br>(fps) |     | Cost/performance<br>(\$/frame/second) |  |
|-------------------------------------------------------|----------------------|---------------------------------------|-----|---------------------------------------|--|
| Xilinx<br>Spartan3ADSP<br>XC3SD3400A<br>chip          | \$27                 |                                       | 183 | \$0.14                                |  |
| Texas<br>Instruments<br>TMS320DM6437<br>DSP processor | \$21                 |                                       | 5.1 | \$4.20                                |  |



BDTi evaluation of AutoPilot http://www.bdti.com/articles/AutoPilot.pdf

23

# **Vivado High-Level Synthesis: Accelerated IP Development and Design Space Exploration**

- > C libraries:
  - · Arbitrary precision
  - Floating-point math.h
  - · OpenCV video functions
  - DSP
  - Linear algebra
- > Accelerated verification
  - >100X faster than RTL design
- Fast compilation and design exploration
  - Algorithm feasibility
  - Architecture Iteration



|                        | RTL         | Vivado HLS  |
|------------------------|-------------|-------------|
| Design Time<br>(weeks) | 12          | 1           |
| Latency<br>(ms)        | 37          | 21          |
| Memory<br>(RAMB18E1)   | 134 (16%)   | 10 (1%)     |
| Memory<br>(RAMB36E1)   | 273 (65%)   | 138 (33%)   |
| Registers              | 29686 (9%)  | 14263 (4%)  |
| LUTs                   | 28152 (18%) | 24257 (16%) |

#### Customer proven results

Page 24

© Copyright 2013 Xilinx

**₹ XILINX** ➤ ALL PROGRAMMABLE.

#### **Production-Proven and Adopted by 1000+ companies**

- "Vivado® HLS enabled easy and fast implementation of 768x768 QRD single precision floating-point design. We like this tool QoR, productivity and flexibility and will deploy in to more production designs."
- ➤ "Vivado HLS made it possible for DSP software engineers to implement LTE layer 1 switch on Zynq® SoCs by enabling us to target more than 500K lines of C code."
- ➤ "We developed C++ DSP functions using Vivado HLS and the results met size/speed goal for commercial platform deployment on Virtex®-7."

© Copyright 2013 Xilinx

**₹ XILINX** ➤ ALL PROGRAMMABLE.

#### **HLS Alone Is Not Enough for Customized Computing**

- A large design space for software and hardware co-design
- Also need automated source code transformation for HLS friendly C/C++ code to enable
  - Concurrent memory access
  - Data reuse and on-chip buffer generation
  - Data prefetching













# Linear-Transformation-Based Partitioning\* ■ Linear transformation-based approach ■ Multidimensional address $\vec{x}$ linearization: $L(\vec{x}) = \vec{\alpha} \cdot \vec{x}$ ■ Bank mapping: bank( $\vec{x}$ )= $L(\vec{x})$ mod N (Cyclic) ■ Example: denoise x1 x1 Bank2 Bank3 Bank4 Bank0 Bank1 x0 color:banks $\forall i,j, A[a_1*i+b_1][c_1*j+d_1] \text{ not conflict with } A[a_1*i+b_1][c_1*j+d_1]$ $\Rightarrow \gcd((\alpha_1 (a_1-a_2)+\alpha_2 (c_1-c_2)), N) \nmid (\alpha_1 (b_1-b_2)+\alpha_2 (d_1-d_2))$ "Y. Wang, P. Li, P. Zhang, C. Zhang and J. Cong. DAC 2013



# **Conflict Polytope**

 Conflict polytope of two references is a subset of iteration domain where the two references are mapped on the same bank

$$\begin{cases} 1 \le i \le n \\ 1 \le j \le n \end{cases}$$
 for (i = 1; i <= n; i++)  
 for (j = 1; j <= min(n,n-i+2); j++)  
 foo(A[j+1][i+1], A[j][2\*i]);

■ Insert an extra variable k to linearize

$$(i+1)+(j+1)=(2i+j)+3k$$

 $(i+1)+(j+1) \equiv (2i+j) \mod 3$ 

- Fourier-Motzkin Algo. (Fourier 1826, Motzkin1936)
  - Test the emptiness of the conflict polytope
  - Algorithm complexity:  $O(m^{2^i})$ 
    - m: number of inequalities (m=4 in the example)
    - t: number of of variables (t=3 in the example)
    - independent of iteration domain size n



#### Latest Research - Automating Customized Computing Polyhedral-Based Data Reuse Input Code(C/C++) **Optimization for Configurable Computing** FPGA'13 Best Paper Award **Program Analysis Loop Structure** Improving Polyhedral Code Generation Loop Optimization for High-Level Synthesis CODES-ISSS'14 Restructuring **Best Paper Award Code Generation** Theory and Algorithm for Generalized Memory Partitioning in High-Level **Array Partitioning** Synthesis, FPGA'14 **Data Layout** Optimization An optimal microarchitecture for stencil **Data Reuse** computation acceleration based on nonuniform partitioning of data reuse buffers Module Selection/ (DAC'14) replication Inter-Module **Combining Computation with** Communication Optimization **Communication Optimization in System** Optimization Synthesis for Streaming Applications, Module-level FPGA'14 Scheduling 36

# Motivation

Tile size: 32x32 Image: 64x64, 4 tiles



#### ■ Which implementation to use for each module?

■ Memory partitioned v.s. non-memory-partitioned

|                          | BRAM | DSP | FF    | LUT   |
|--------------------------|------|-----|-------|-------|
| non-partitioned gradient | 128  | 21  | 2511  | 2125  |
| partitioned gradient     | 176  | 56  | 7147  | 7262  |
| partitioned rician       | 128  | 22  | 4692  | 3991  |
| non-partitioned rician   | 176  | 88  | 14475 | 15537 |

37

#### **Motivation**

Tile size: 32x32

Image: 64x64, 4 tiles



tile 3

tile 2

- How many number of replicas?
- Scheduling and Communication cost (number of tiles in the communication channel)?



tile 0 / tile 0 gradient rician

tile 3

38

tile 2

scheduling 0 →1 tile

scheduling 0 →2 tiles



#### Formulation (1/2)

#### ■ Derive a scheduling graph

- Associate each node with a time variable, denoting the starting time of the node
- Scheduling graph: delineates all the scheduling constraints
  - Module latency, Module replication, System throughput requirement, Buffer constraints



Module latency constraints et<sub>a</sub>: execution time of task a et<sub>b</sub>: execution time of task b

#### **Buffer Constraints**

If buffer size between a and b is 2, then add edges:  $b^0 \rightarrow a^2 \ b^1 \rightarrow a^3$ 

## Formulation (2/2)

- Associate each node with a scheduling variable
  - $t(b^0) t(a^0) >= et_a$
  - $t(a^2) t(b^0) >= et_b$
  - ...
  - Scheduling variables are integer variables



- Schedulability checking problem is a <u>System of Difference</u> <u>Constraints (SDC)</u> problem
  - It can be solved optimally in polynomial time by linear programming relaxation
  - And the solution is guaranteed to be integers

41

# **Exploration**



#### Scheduling Graph

- Find the length of longest path (maxL)
- In this example, maxL = 8

#### Find critical paths

- Find all the paths whose lengths are *maxL*,
- or more aggressively, (1-ε)\*maxL

#### Module Improvement

- Associate each edge a new weight – the area penalty to remove this edge from the critical paths
- Find a minimum cut on the graph



## **Experiments on Example Denoise**

- Our methodology: ST-Syn
  - computation & communication co-optimization
- Separate
  - separate computation opt. + communication opt.
- → Communication and computation should be considered in a unified framework



# More is Needed for Data Center Level Deployment

45

## Scalable Big-Data Programming

- Simplified programming models
  - MapReduce, Dataflow
- User-transparent Runtime
  - Distributed computing
  - Scheduling and resource management
  - Fault-tolerance





Spark Worker







# Challenges and Opportunities for Logic Synthesis

-- Ultra Fast Synthesis

#### Example Neptune Inc. (2011-2013)

Fast FPGA Place/Route Solution Provider



Vision: Enable Ultra Fast FPGA Compilation



- Delivered Placer 20x speedup vs. Xilinx tool
- Tightly integrated with Xilinx ISE and Vivado environment
- Supported advanced Virtex-6 and Virtex-7 Architecture
- Triggered tremendous interest in emulation and reconfigurable computing
- Acquired by Xilinx in 2013







# Can We Do the Same or Better for Logic Synthesis?

Goal: Synthesize 1M Cells Per Minute?

Utilizing cloud computing, GPU/FPGA acceleration, etc?

# **Concluding Remarks**

- New era of computing
  - Accelerator-centric computing
  - Need efficient support for customization and specialization
- Customization at all levels
  - Chip-level
  - Server node level
  - Data center level
- Data center level customization holds great promise
  - That's where workload aggregates

53

# Concluding Remarks (Cont'd)

- Software support is critical
  - Programming models
    - Hadoop/MapReduce or SPARK (+ C/C++), OpenMP, OpenCL,, ...
  - Fast and simple compilation
  - Runtime management
- Need critical mass for FPGA acceleration libraries

#### Acknowledgements

- Center for Domain-Specific Computing (CDSC) under the supports NSF Expeditions in Computing Program, Fujitsu, Intel, and Mentor Graphics under the NSF InTrans Program
- C-FAR Center under the STARnet Program
- CDSC faculty:







Baraniuk (Rice)



Bui (UCLA)



Chang (UCLA)



Cheng (UCSB)



Cong (Director) (UCLA)



Palsberg (UCLA)



Potkonjak (UCLA)



Reinman (UCLA)



Sadayappan (Ohio-State)



Sarkar (Associate Dir) (Rice)



(UCLA)

55

# More Acknowledgements

# -- Postdocs, Graduate Students, and Collaborators



**Prof. Deming Chen** (UIUC/ADSC)



Yuting Chen (UCLA)



**Hui Huang** (UCLA)



Muhuan Huang (UCLA/Falcon Comp)



Dr. Peng Li (UCLA)



(Falcon Comp)



Dr. Peichen Pan Prof. Louis-Noël Pouchet (UCLA)



Yuxin Wang (PKU)



Di Wu (UCLA/Falcon Comp)



Bingjun Xiao (UCLA)



Hao Yu (UCLA)



Dr. Peng Zhang (Falcon Comp.)



Yi Zou (UCLA)



Wei Zuo (UIUC)